QTM 447 Lecture 20: Generative Models: Autoregressives and Autoencoders

Kevin McAlister

March 27, 2025

\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]

Generative Models

We’ll dig into generative models more over the final part of this class.

Just a quick overview/solution today

  • Start digging into the main methods for images and video starting next week

Generative Models

Recall that all machine learning is about learning a probability distribution

Most commonly supervised learning

  • Given input/output pairs, learn a model that finds the function that maps \(\mathbf x \rightarrow \mathbf y\) for all \(\mathbf x\) that we could see.

Under some light assumptions:

\[ f(\mathbf x) = P(y | \mathbf x) \]

Generative Models

There’s also unsupervised learning

Different goal: Learn structure within the input under set of assumptions

Given a set of observations, \(\mathbf x_i \in \mathbf X\), learn about the structure of the inputs

\[ f(\mathbf x) = P(\mathbf x) \]

Generative Models

These methods are designed to learn structure in an input corpus

  • Clustering - find groups of inputs that are similar

  • PCA - project high dimensional data to a lower dimensional subspace that retains information

  • Density estimation - learn the structure of the data in terms of likelihood

Generative Models

Why learn \(f(\mathbf x)\) instead of \(f(y | \mathbf x)\)?

  • Structure: Choose an appropriate set of customers for different strategies based on clusters. (K-Means)

  • Description: Visualize a high dimensional feature set in 2 or 3 dimensions without losing info. (PCA)

  • Latent Variables: Find hidden structure within the data that corresponds to important concepts (Factor Analysis)

  • Prototypes: Find the most dog of all dogs and design supervised learning frameworks to match this info (Semi-supervised learning)

Generative Models

Another purpose. Dialogue from The Wire:

“I got the shotgun. You got the briefcase.”

“You ain’t a pawn in this game. You’re the whole damn chessboard.”

Generative Models

Another purpose. Dialogue from The Wire:

“I got the shotgun. You got the briefcase.” (Real)

“You ain’t a pawn in this game; you’re the whole damn chessboard.” (Fake by ChatGPT)

Generative Models

Generative Models

Generative Models

MyFirstMovie.mp4

Generative Models

Generative Unsupervised Models learn from a set of inputs (text, images, tabular data)

  • Synthesize new data examples that are statistically indistinguishable from the training data

The general strategy:

Assume that each training example is an i.i.d. draw from a true data generating distribution, \(\mathbf x_i \sim \mathbf X\)

  • For text, over all words

  • For images, over all pixels

Generative Models

Given samples, pick the parameters, \(\boldsymbol \theta\), from a family of model distributions that minimize some notion of distance between the data we see and the model distribution

  • Sound like anything we’ve seen (minimize the divergence, maximize the similarity)?

Then sample from the data generating distribution to get new viable feature values!

  • Get a new picture

  • Get a new set of words

Generative Models

This is actually the structure for common unsupervised learning models!

K-Means Clustering

  • Assume each \(\mathbf x_i\) conditioned on a latent class label follows a multivariate normal distribution

    \[ f(\mathbf x_i | c_i = g) \sim \mathcal N_P(\mathbf x_i | \boldsymbol \mu_g , \boldsymbol \Sigma_g) \]

(Probabilistic) PCA

  • Assume each \(\mathbf x_i\) is generated as the product of a weight matrix, \(\underset{(P \times K)}{\boldsymbol \Lambda}\), and a latent variable, \(\underset{(K \times 1)}{\mathbf z_i}\)

    \[ f(\mathbf x_i | \boldsymbol \Lambda , \mathbf z_i) \sim \mathcal N_P(\mathbf x_i | \boldsymbol \Lambda \mathbf z_i , \mathcal I_P) \]

Generative Models

For a new value of the latent variable ( \(c\) or \(\mathbf z\)), generate a new output as a draw from the model distribution

  • A new observation within the cluster dictated by \(c\)

  • A new observation given a latent vector, \(\mathbf z\), drawn from the overall multivariate normal!

That’s generation!

Generative Models

Note that our old methods, more or less, require that the data generating distribution be multivariate normal

  • Common mean

  • Common covariance matrix between features (pixel values/words)

Does this seem like a viable assumption?

Generative Models

Images and text have complex dependencies

  • Neighborhood effects

  • Context

Do you think that a multivariate normal distribution that is a linear combination of weights and hidden variables can effectively model the joint distribution of images?

Generative Models

Also, incredibly high dimensional

The IPhone 15 claims to be able to take pictures that are 6000 x 4000 pixels - super high clarity!

  • RGB means three pixel values per pixel that are integers between 0 and 255

The total number of possible images we could see

\[ 256^{6000 \times 4000 \times 3} \approx 256^{72 \text{million}} \]

  • That’s a lot of images!

DALL-E seems to be able to generate anything we want!

  • Does it uncover a distribution over all viable images?

Generative Models

Fortunately, the world is structured!

  • We aren’t just a random collection of different colors

  • I’d be willing to venture that I’m the only person in the world right now typing the text “Peanut Butter, Egg, Dirt”

Some things are just unlikely to be viable outputs

  • The goal is to mimic the input which is structured!

Generative Models

Goal: Come up with a strategy to learn \(P(\mathbf x)\) given a large set of inputs - \(\mathbf X\)

Success:

  • Density estimation: Given a proposed data point, \(\mathbf x_i\), what is the probability with which we could expect to see that data point? Don’t generate data points that have low probability of occurrence!

  • Sampling: How can we generate novel data from the model distribution? We should be able to sample from the distribution!

  • Representation: Can we learn meaningful feature representations from \(\mathbf x\)? Do we have the ability to exaggerate certain features?

All methods we’ll talk about can be sampled!

  • Differences in the ability to do the other 2

Generative Models

Generative Models

A final broad point:

We can also use these frameworks to learn the joint density of the inputs and any labels to do conditional generation

\[ P(\mathbf x | y) = \frac{P(y | \mathbf x) P(\mathbf x)}{P(y)} \]

If we learn the joint density of the inputs and any labels, then we can condition on the label

  • Dog vs. Brown dog

Train a classifier, multiply by joint input probability to get posterior

  • We’ll just assume that this we can do this unless it’s relevant to differentiate between marginal and conditional

Autoregressive Generation

Goal: Come up with a strategy to learn \(P(\mathbf x)\) given a large set of inputs - \(\mathbf X\)

  • A collection of images

  • A collection of sentences

The easiest way to approach this is with a common probability identity.

Let \(\mathbf x\) be a vector of inputs - \([x_1,x_2,....,x_P]\)

The joint density over inputs is then:

\[ P(\mathbf x) = f(x_1,x_2,x_3,...,x_P) \]

Autoregressive Generation

The probability chain rule:

\[ f(x_1,x_2,x_3,...,x_P) = f(x_1)f(x_2 | x_1)f(x_3 | x_1,x_2)...f(x_P | x_{P-1}, x_{P-2},...) \]

If it is possible to learn the conditional density of the next input given the previous inputs, then we’ve learned the joint density of the inputs!

Does this look like anything we’ve recently talked about?

Autoregressive Generation

Autoregressive Generation

Autoregressive Generation

Generative Pretrained Transformers are autoregressive generators

  • Find the weights for masked self-attention that maximize the probability that we generate the correct next word

  • Learns \(P(\mathbf x)\) instead of \(P(y | \mathbf x)\)!

  • Technically, provides \(\underset{\mathbf x}{\text{argmax}} P(\mathbf x)\)

Using GPT for classification is kinda a square peg in a round hole

  • A lot like K-Means + Logistic Regression for classification

  • Semi-supervised learning

Autoregressive Generation

ChatGPT is trained on the entirety of the English language

  • All it’s really doing is learning the joint distribution over the entire English language!

Conditional language generation follows the same principle:

Let \(\mathbf y\) be the prompt token

\[ P(\mathbf x | \mathbf y) = f(x_1 | \mathbf y)f(x_2 | \mathbf y , x_1)... \]

Autoregressive Generation

At this point in time, this is the only real dog in the text generation fight

  • Easy to understand if you understand transformers

  • This one-ahead approach seems to work remarkably well for language generation

  • I have a hard time thinking about the next frontier in text generation - just improvements to the GPT training process

Autoregressive Generation

Can we generate images this way?

Sorta!

The first proposed method was the PixelRNN (Van den Oord, 2016)

  • Assume images are generated as rectangular pixel grids

  • Start in the top left corner, \(x_{11}\)

  • Take a draw from two conditional densities to get next two pixels, \(f(x_{12} | x_{11})\) and \(f(x_{21} | x_{11})\)

  • Get the next 4 pixels

  • Then 8…

Autoregressive Generation

Autoregressive Generation

Autoregressive Generation

Autoregressive Generation

Autoregressive Generation

Autoregressive Generation

Autoregressive Generation

Autoregressive Generation

The probability model uses the LSTM architecture.

Assume each RGB vector for each pixel depends on a hidden state that is determined recurrently from its neighbors

\[ \mathbf h_{x,y} = f(\mathbf h_{x-1,y}, \mathbf h_{x,y-1}) \]

  • Recurrently unravel the image

Train over a large corpus to learn all of the hidden state transitions!

Autoregressive Generation

Implicitly depends on all previously seen pixels!

Autoregressive Generation

This works quite well in some situations!

Autoregressive Generation

What do you think are the drawbacks to this approach for image generation?

Autoregressive Generation

Advancements in recurrent image generation:

  • Pixel CNN - exchange recurrence for CNN style windowing; a little faster

  • ImageGPT - exchange recurrence for masked self-attention; a little faster with really large training sets

These do a great job when we can compute them in finite time!

  • Explicit density that can easily sample new images

Autoregressive Generation

Painfully slow!!!

  • ImageGPT has only been shown to work for up to 64 x 64 input images

  • Literal days of training time to learn joint densities for small-ish data sets

  • Can’t be used generally to create a dictionary of all images!

Autoregressive Generation

What autoregressive methods do well:

  • Explicit density estimation - given a starting prompt/word, we can compute the joint probability of the next word directly

  • Fast at generation (for text)

What autoregressive methods don’t do well:

  • Images (too high dimensional)

  • No feature representation

Autoregressive Generation

What can we do to generate images?

  • We’ll need different machinery.

The rest of the class will be on image/video generation!

  • These methods can be used for text (or other 1d sequential data like sound waves)

  • Autoregressives are just the dominant generation method at this point in time!

Also, very useful for descriptive purposes!

PCA

A classic in the unsupervised literature is principal components analysis.

A more general statement:

Given a corpus of inputs, find a low dimensional representation of the inputs that minimizes reconstruction error

\[ \frac{1}{N} \sum \limits_{i = 1}^N \| \mathbf x_i - d(e(\mathbf x_i))\|^2_2 \]

  • \(e(\mathbf x_i) = \mathbf z_i\) is an encoder function that maps the input to a latent space

  • \(d(\mathbf z_i)\) is a decoder function that maps the latent variable back to original feature space

PCA

How we choose to find \(e()\) and \(d()\) dictates the method

PCA

Given a \(N \times P\) matrix of input features, find a single weight matrix, \(\mathbf W\), with column rank \(K\) that minimizes the objective function:

\[ \frac{1}{N} \sum \limits_{i = 1}^N \| \mathbf x_i - \mathbf W \mathbf W^T \mathbf x_i\|^2_2 \]

  • If the input features are correlated, then we don’t need to see them all individually to reconstruct the input

  • Find a lower dimensional representation of the input vector that retains most of the information

PCA

\(\mathbf W\) is a \(P \times K\) matrix of weights where \(K << P\) that controls both the mapping of the input to the latent space and the mapping of the latent space back to the input space

\(\mathbf z_i\) is a \(K\)-dimensional latent representation of the input that is derived through a linear mapping - \(\mathbf W^T \mathbf x_i\)

\[ \frac{1}{N} \sum \limits_{i = 1}^N \| \mathbf x_i - \mathbf W \mathbf z_i \|^2_2 \]

  • In theory, \(\mathbf z_i\) embeds the inputs in an interpretable latent space that combines all of the redundant info that we see in the input space!

Two additional constraints:

  • The columns of \(\mathbf W\) should be orthogonal directions

  • The columns of \(\mathbf W\) should be normalized

PCA

PCA operates by using the eigenvalues and eigenvectors of the square covariance matrix for the centered feature matrix:

\[ \mathbf X^T \mathbf X = \mathbf Q \mathbf D \mathbf Q^{-1} \]

where \(\mathbf Q\) is the collection of \(P\) eigenvectors ordered by eigenvalue (from large to small)

  • All of the eigenvalues will be positive since the covariance matrix is positive definite

To create a rank \(K\) approximation of the original feature matrix, set \(\mathbf W = \mathbf Q_{1:K}\)

  • Take the first \(K\) eigenvectors

PCA

What’s neat about PCA is that it produces the best orthogonal rank-K approximation to the input under linearity and squared reconstruction error

  • No other method (under these constraints) will do better!

  • This proof can be found here

PCA

Why this construction?

  • It admits a nice analytical solution with good properties.

  • Can be solved via a basic computer that can find eigenvectors

Computational constraints led to the solution - no other particularly good reason!

PCA

Using PCA, we can:

  • Get rid of redundant features

  • Reduce dimensionality

  • Parse away noise

  • Pass latent scores, \(\mathbf Z\), to supervised learning procedures

PCA

Additionally, the dimensions of the latent space correspond to factors of variation within the training set!

  • Create a feature map for the original input data
  • Feature map corresponds to something about the inputs - perhaps a more abstract notion of features (curviness, thickness, etc.)

A key component of the data scientist’s toolkit!

PCA

Problem: Linearity and strictly invertible mappings are restrictive

Think of PCA as creating a mapping from the feature space to a latent space and then inverting the mapping to return an approximation to the original feature space

  • The latent space - in theory - includes all of the necessary information needed to reconstruct \(\mathbf X\) in a lower dimensional subspace

PCA seeks to solve the generic problem of reconstruction:

\[ \|\mathbf X - d(e(\mathbf X))\|^2_2 \]

  • Other than computational nicety, there’s no reason we can’t find a low dimensional reconstruction using nonlinear methods

Deterministic Autoencoders

Deterministic Autoencoders

Deterministic Autoencoders

Deterministic Autoencoders

Deterministic Autoencoders

This is referred to as a deterministic bottleneck autoencoder

  • Learn a set of encoder and decoder functions that map \(\mathbf X\) to itself!

  • Restriction: each input instance passes through a low-dimensional bottleneck ( \(K << P\) )

  • Can’t just learn itself, so needs to set up \(\mathbf Z\) to represent as much of the variation in \(\mathbf X\) as possible

Note that PCA is a special case!

  • Restrict the feedforward layers to have linear activations and be invertible!

No good reason to do this when we have autodiff that can solve for arbitrarily complex models

Deterministic Autoencoders

How do you choose the size of the bottleneck?

  • This largely echoes the same discussion for PCA

  • Just choose a small number and see how well we’re able to recover the images

  • Stop adding dimensions when the next one doesn’t reduce the loss too much

  • Use validation error to choose - typically not needed since this would require serious tuning.

  • Set at 2 for images, 16/32/64/128/etc.

Deterministic Autoencoders

A few notes:

  • Everyone understands PCA, not necessarily autoencoders. Follow disciplinary norms.

  • For images, the CNN backbone helps to induce structure in the recovered images! The reconstruction loss is about the same as a deep feedforward autoencoder in a lot of situations, but it helps to find structure for recovery.

  • For images, autoencoders strictly dominate PCA!

Deterministic Autoencoders

What can we do with deterministic autoencoders?

Like transformers, split for different tasks.

Encoder ( \(\mathbf X \rightarrow \mathbf Z\) )

  • Use as you would for PCA. Pass lower dimensional representation to supervised model for better predictive performance.

  • Use to examine feature maps to understand sources of shared variation within input features.

Deterministic Autoencoders

The decoder is the interesting part!

Decoder ( \(\mathbf Z \rightarrow \mathbf X\) ):

  • In theory, \(\mathbf Z\) shares most of the same information as \(\mathbf X\).

  • Lower dimensional representation means more manageable.

  • The decoder dictates a mapping from \(\mathbf Z\) to \(\mathbf X\)

Any thoughts?

Deterministic Autoencoders

Given an input \(\mathbf z\), we could use that to translate back up to the feature space!

How do we choose \(\mathbf z\) to be viable?

  • Random dart throws?

  • Pick a point in the convex hull of the latent space defined by the training data?

Deterministic Autoencoders

Goal: Come up with a strategy to learn \(P(\mathbf x)\) given a large set of inputs - \(\mathbf X\)

Success:

  • Density estimation: Given a proposed data point, \(\mathbf x_i\), what is the probability with which we could expect to see that data point? Don’t generate data points that have low probability of occurrence!

  • Sampling: How can we generate novel data from the model distribution? We should be able to sample from the distribution!

  • Representation: Can we learn meaningful feature representations from \(\mathbf x\)? Do we have the ability to exaggerate certain features?

Deterministic Autoencoders

This approach doesn’t generate viable images

This isn’t a proper probabilistic model!

  • There isn’t ever an expression of \(P(\mathbf X)\)

  • The method can’t learn how to fill in the gaps between digits appropriately because we never asked it to do that!

We can’t say the probability with which we would expect to see a data point given the data.

We can’t sample from the resulting approximation to \(P(\mathbf X)\) because there isn’t a distribution.

Generative PCA

A probabilistic version of PCA (sort of kinda) is called factor analysis

The construction is pretty similar - \(P\) dimensional feature space to \(K\) dimensional latent space, \(\mathbf Z\)

The difference is that we give structure to the latent space by making a prior assumption

  • Not that consequential since the latent space is to be learned from the data.

  • Just determines what the latent space will look like.

Generative PCA

Most common version - Gaussian factor analysis

\[ f(\mathbf z) = \mathcal N_K(\mathbf z | \mathbf 0 , \mathcal I_K) \]

\[ f(\mathbf x | \mathbf z , \boldsymbol \Theta) = \mathcal N_P(\mathbf x | \mathbf W \mathbf z + \boldsymbol \alpha , \boldsymbol \Psi) \]

where \(\mathbf W\) is a \(P \times K\) matrix of weights, \(\mathbf z\) is a \(K\)-vector of latent values, \(\boldsymbol \alpha\) is an intercept term, and \(\boldsymbol \Psi\) is a \(P \times P\) covariance matrix.

Generative PCA

What does this give us?

Like PCA, the goal is to find \(\mathbf W\) and \(\mathbf Z\) that minimize the reconstruction error given the training data.

But, we’re going to show that this is equivalent to estimating a Bayesian model

  • And posterior distributions represent probabilistic distributions over the latent space

  • e.g. we’re properly estimating a low rank approximate distribution over the input features that can be used to viably generate new feature sets!

Then, port this generative model to the autoencoder framework to get the variational autoencoder